Introduction to Open Data Science - Course Project

About the project

I initially found about this course thanks to a good friend and colleague. The subjects in this course interest me a lot. I believe that science should be shared as much as possible. However, I do lack some knowledge and experience relating to it. My most important contributions, the multicast and linear optimization extensions of NCorg DNC (previously DiscoDNC tool) have been shared in my old github profile. However, I believe that it was done in a limited manner. More specifically, I doubt that another researcher would be able to easily reproduce my experiments without a lot of effort on their part.

Furthermore, I believe that over time my data analysis has improved quite a lot. However, I never had a focused lecture on how to analyse and share information related to results of experiemnts. Meaning I am quite excited for the data science portion of this lecture.

Regarding the crash course on R. Like I mentioned in my introduction (in Moodle) I have used R during my master. In fact, this was the time I worked on the NCorg DNC tool I mentioned above. Therefore, the crash course was not necessary for me, as I already know the basics. What was new to me was Rstudio. I am quite positively surprised about it actually. My initial experiences were quite negative, and most of my previous work was done using Sublime text editor and directly using Rscript. The last class and exercises made me see the potential in this programming environment, and I am looking forward to future exercises to see how this tool can help me.

I initially wrote my own version of this assignment. However, looking at the exercise 2 I decided to follow it instead, to make sure I followed everything correctly. You can see my previous version trhough my previous commit.

This set consists of a few numbered exercises. Go to each exercise in turn and do as follows:

  1. Read the brief description of the exercise.
  2. Run the (possible) pre-exercise-code chunk.
  3. Follow the instructions to fix the R code!

2.0 INSTALL THE REQUIRED PACKAGES FIRST!

One or more extra packages (in addition to tidyverse) will be needed below.

# Select (with mouse or arrow keys) the install.packages("...") and
# run it (by Ctrl+Enter / Cmd+Enter):

# install.packages("GGally")

2.1 Reading data from the web

The first step of data analysis with R is reading data into R. This is done with a function. Which function and function arguments to use to do this, depends on the original format of the data.

Conveniently in R, the same functions for reading data can usually be used whether the data is saved locally on your computer or somewhere else behind a web URL.

After the correct function has been identified and data read into R, the data will usually be in R data.frame format. Te dimensions of a data frame are (\(n\),\(d\)), where \(n\) is the number of rows (the observations) and \(d\) the number of columns (the variables).

The purpose of this course is to expose you to some basic and more advanced tasks of programming and data analysis with R.

Instructions

  • Read the lrn14 data frame to memory with read.table(). There is information related to the data here
  • Use dim() on the data frame to look at the dimensions of the data. How many rows and colums does the data have?
  • Look at the structure of the data with str().

Hint: - For both functions you can pass a data frame as the first (unnamed) argument.

R code

# This is a code chunk in RStudio editor.
# Work with the exercise in this chunk, step-by-step. Fix the R code!

# read the data into memory
lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)

# Look at the dimensions of the data
dim(lrn14)
## [1] 183  60
# Look at the structure of the data
str(lrn14)
## 'data.frame':    183 obs. of  60 variables:
##  $ Aa      : int  3 2 4 4 3 4 4 3 2 3 ...
##  $ Ab      : int  1 2 1 2 2 2 1 1 1 2 ...
##  $ Ac      : int  2 2 1 3 2 1 2 2 2 1 ...
##  $ Ad      : int  1 2 1 2 1 1 2 1 1 1 ...
##  $ Ae      : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ Af      : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ ST01    : int  4 4 3 3 4 4 5 4 4 4 ...
##  $ SU02    : int  2 2 1 3 2 3 2 2 1 2 ...
##  $ D03     : int  4 4 4 4 5 5 4 4 5 4 ...
##  $ ST04    : int  4 4 4 4 3 4 2 5 5 4 ...
##  $ SU05    : int  2 4 2 3 4 3 2 4 2 4 ...
##  $ D06     : int  4 2 3 4 4 5 3 3 4 4 ...
##  $ D07     : int  4 3 4 4 4 5 4 4 5 4 ...
##  $ SU08    : int  3 4 1 2 3 4 4 2 4 2 ...
##  $ ST09    : int  3 4 3 3 4 4 2 4 4 4 ...
##  $ SU10    : int  2 1 1 1 2 1 1 2 1 2 ...
##  $ D11     : int  3 4 4 3 4 5 5 3 4 4 ...
##  $ ST12    : int  3 1 4 3 2 3 2 4 4 4 ...
##  $ SU13    : int  3 3 2 2 3 1 1 2 1 2 ...
##  $ D14     : int  4 2 4 4 4 5 5 4 4 4 ...
##  $ D15     : int  3 3 2 3 3 4 2 2 3 4 ...
##  $ SU16    : int  2 4 3 2 3 2 3 3 4 4 ...
##  $ ST17    : int  3 4 3 3 4 3 4 3 4 4 ...
##  $ SU18    : int  2 2 1 1 1 2 1 2 1 2 ...
##  $ D19     : int  4 3 4 3 4 4 4 4 5 4 ...
##  $ ST20    : int  2 1 3 3 3 3 1 4 4 2 ...
##  $ SU21    : int  3 2 2 3 2 4 1 3 2 4 ...
##  $ D22     : int  3 2 4 3 3 5 4 2 4 4 ...
##  $ D23     : int  2 3 3 3 3 4 3 2 4 4 ...
##  $ SU24    : int  2 4 3 2 4 2 2 4 2 4 ...
##  $ ST25    : int  4 2 4 3 4 4 1 4 4 4 ...
##  $ SU26    : int  4 4 4 2 3 2 1 4 4 4 ...
##  $ D27     : int  4 2 3 3 3 5 4 4 5 4 ...
##  $ ST28    : int  4 2 5 3 5 4 1 4 5 2 ...
##  $ SU29    : int  3 3 2 3 3 2 1 2 1 2 ...
##  $ D30     : int  4 3 4 4 3 5 4 3 4 4 ...
##  $ D31     : int  4 4 3 4 4 5 4 4 5 4 ...
##  $ SU32    : int  3 5 5 3 4 3 4 4 3 4 ...
##  $ Ca      : int  2 4 3 3 2 3 4 2 3 2 ...
##  $ Cb      : int  4 4 5 4 4 5 5 4 5 4 ...
##  $ Cc      : int  3 4 4 4 4 4 4 4 4 4 ...
##  $ Cd      : int  4 5 4 4 3 4 4 5 5 5 ...
##  $ Ce      : int  3 5 3 3 3 3 4 3 3 4 ...
##  $ Cf      : int  2 3 4 4 3 4 5 3 3 4 ...
##  $ Cg      : int  3 2 4 4 4 5 5 3 5 4 ...
##  $ Ch      : int  4 4 2 3 4 4 3 3 5 4 ...
##  $ Da      : int  3 4 1 2 3 3 2 2 4 1 ...
##  $ Db      : int  4 3 4 4 4 5 4 4 2 4 ...
##  $ Dc      : int  4 3 4 5 4 4 4 4 4 4 ...
##  $ Dd      : int  5 4 1 2 4 4 5 3 5 2 ...
##  $ De      : int  4 3 4 5 4 4 5 4 4 2 ...
##  $ Df      : int  2 2 1 1 2 3 1 1 4 1 ...
##  $ Dg      : int  4 3 3 5 5 4 4 4 5 1 ...
##  $ Dh      : int  3 3 1 4 5 3 4 1 4 1 ...
##  $ Di      : int  4 2 1 2 3 3 2 1 4 1 ...
##  $ Dj      : int  4 4 5 5 3 5 4 5 2 4 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : chr  "F" "M" "F" "M" ...

2.2 Scaling variables

The next step is wrangling the data into a format that is easy to analyze. We will wrangle our data for the next few exercises.

A neat thing about R is that may operations are vectorized. It means that a single operation can affect all elements of a vector. This is often convenient.

The column Attitude in lrn14 is a sum of 10 questions related to students attitude towards statistics, each measured on the Likert scale (1-5). Here we’ll scale the combination variable back to the 1-5 scale.

Instructions

  • Execute the example codes to see how vectorized division works
  • Use vector division to create a new column attitude in the lrn14 data frame, where each observation of Attitude is scaled back to the original scale of the questions, by dividing it with the number of questions.

Hint: - Assign ‘Attitude divided by 10’ to the new column ’attitude.

R code

# This is a code chunk in RStudio editor.
# Work with the exercise in this chunk, step-by-step. Fix the R code!

#lrn14 is available

# divide each number in a vector
c(1,2,3,4,5) / 2
## [1] 0.5 1.0 1.5 2.0 2.5
# print the "Attitude" column vector of the lrn14 data
lrn14$Attitude
##   [1] 37 31 25 35 37 38 35 29 38 21 39 38 24 30 26 25 41 26 26 17 27 39 34 27 23
##  [26] 37 44 41 24 37 25 30 34 32 20 24 42 16 31 38 38 33 17 25 32 35 32 42 31 39
##  [51] 19 21 25 32 32 26 23 38 28 33 48 40 40 47 23 31 27 41 34 34 25 21 14 19 37
##  [76] 41 32 28 41 25 28 38 31 35 36 26 44 45 32 20 39 25 33 35 33 30 29 33 33 35
## [101] 36 42 37 28 42 22 32 50 47 36 36 29 35 40 35 32 26 20 27 32 33 39 33 30 37
## [126] 14 30 25 29 39 36 29 21 31 24 40 31 23 28 37 26 24 30 29 32 28 29 24 31 19
## [151] 20 38 34 37 29 23 41 27 35 34 32 33 33 35 32 31 24 28 17 19 32 35 24 38 21
## [176] 29 19 20 42 41 37 36 18
# divide each number in the column vector
attitudeDiv = lrn14$Attitude / 10

# create column 'attitude' by scaling the column "Attitude"
lrn14$attitude <- attitudeDiv

2.3 Combining variables

Our data includes many questions that can be thought to measure the same dimension. You can read more about the data and the variables here. Here we’ll combine multiple questions into combination variables. Useful functions for summation with data frames in R are

function description
colSums(df) returns a sum of each column in df
rowSums(df) returns a sum of each row in df
colMeans(df) returns the mean of each column in df
rowMeans(df) return the mean of each row in df

We’ll combine the use of rowMeans()with the select() function from the dplyr library to average the answers of selected questions. See how it is done from the example codes.

Instructions

  • Access the dplyr library
  • Execute the example codes to create the combination variables ‘deep’ and ‘surf’ as columns in lrn14
  • Select the columns related to strategic learning from lrn14
  • Create the combination variable ‘stra’ as a column in lrn14

Hints: - Columns related to strategic learning are in the object strategic_questions. Use it for selecting the correct columns. - Use the function rowMeans() identically to the examples

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# lrn14 is available

# Access the dplyr library
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# questions related to deep, surface and strategic learning
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06",  "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")

# select the columns related to deep learning 
deep_columns <- select(lrn14, one_of(deep_questions))
# and create column 'deep' by averaging
lrn14$deep <- rowMeans(deep_columns)

# select the columns related to surface learning 
surface_columns <- select(lrn14, one_of(surface_questions))
# and create column 'surf' by averaging
lrn14$surf <- rowMeans(surface_columns)

# select the columns related to strategic learning 
strategic_columns <- select(lrn14, one_of(strategic_questions))

# and create column 'stra' by averaging
lrn14$stra <- rowMeans(strategic_columns)

2.4 Selecting columns

Often it is convenient to work with only a certain column or a subset of columns of a bigger data frame. There are many ways to select columns of data frame in R and you saw one of them in the previous exercise: select() from dplyr*.

dplyr is a popular library for data wrangling. There are also convenient data wrangling cheatsheets to help you get started (dplyr, tidyr etc.)

Instructions

  • Access the dplyr library
  • Create object keep_columns
  • Use select() (possibly together with one_of()) to create a new data frame learning2014 with the columns named in keep_columns.
  • Look at the structure of the new dataset

Hint: - See the previous exercise for help on how to select columns

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!

# lrn14 is available

# access the dplyr library
library(dplyr)

# choose a handful of columns to keep
keep_columns <- c("gender","Age","attitude", "deep", "stra", "surf", "Points")

# select the 'keep_columns' to create a new dataset
learning2014 <- lrn14[keep_columns]

# see the structure of the new dataset
dim(learning2014)
## [1] 183   7

2.5 Modifying column names

Sometimes you want to rename your colums. You could do this by creating copies of the columns with new names, but you can also directly get and set the column names of a data frame, using the function colnames().

The dplyr library has a rename() function, which can also be used. Remember the cheatsheets.

Instructions

  • Print out the column names of learning2014
  • Change the name of the second column to ‘age’
  • Change the name of ‘Points’ to ‘points’
  • Print out the column names again to see the changes

Hint: - You can use colnames() similarily to the example. Which index matches the column ‘Points’?

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available

# print out the column names of the data
colnames(learning2014)
## [1] "gender"   "Age"      "attitude" "deep"     "stra"     "surf"     "Points"
# change the name of the second column
colnames(learning2014)[2] <- "age"

# change the name of "Points" to "points"
colnames(learning2014)[7] <- "points"

# print out the new column names of the data
colnames(learning2014)
## [1] "gender"   "age"      "attitude" "deep"     "stra"     "surf"     "points"

2.6 Excluding observations

Often your data includes outliers or other observations which you wish to remove before further analysis. Or perhaps you simply wish to work with some subset of your data.

In the learning2014 data the variable ‘points’ denotes the students exam points in a statistics course exam. If the student did not attend an exam, the value of ‘points’ will be zero. We will remove these observations from the data.

Instructions

  • Access the dplyr library
  • As an example, create object male_students by selecting the male students from learning2014
  • Override learning2014 and select rows where the ‘points’ variable is greater than zero.
  • If you do not remember how logical comparison works in R, see the ‘Logical comparison’ exercise from the course ‘R Short and Sweet’.

Hint: - The “greater than” logical operator is >

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available

# access the dplyr library
library(dplyr)

# select male students
male_students <- filter(learning2014, gender == "M")

# select rows where points is greater than zero
learning2014 <- filter(learning2014, points > 0)

2.7 Visualizations with ggplot2

ggplot2 is a popular library for creating stunning graphics with R. It has some advantages over the basic plotting system in R, mainly consistent use of function arguments and flexible plot alteration. ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics — a general scheme for data visualization.

In ggplot2, plots may be created via the convenience function qplot() where arguments and defaults are meant to be similar to base R’s plot() function. More complex plotting capacity is available via ggplot(), which exposes the user to more explicit elements of the grammar. (from wikipedia)

There is also a cheatsheet for data visualization with ggplot2.

Instructions

  • Access the ggplot2 library
  • Initialize the plot with data and aesthetic mappings
  • Adjust the plot initialization: Add an aesthetic element to the plot by defining col = gender inside aes().
  • Define the visualization type (points)
  • Draw the plot to see how it looks at this point
  • Add a regression line to the plot
  • Add the title “Student’s attitude versus exam points” with ggtitle("<insert title here>") to the plot with regression line
  • Draw the plot again to see the changes

Hints: - Use + to add the title to the plot - The plot with regression line is saved in the object p3 - You can draw the plot by typing the object name where the plot is saved

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available

# Access the gglot2 library
library(ggplot2)

# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col=gender))

# define the visualization type (points)
p2 <- p1 + geom_point()

# draw the plot
p2

# add a regression line
p3 <- p2 + geom_smooth(method = "lm")

# add a main title
p4 <- p3 + ggtitle("Student's attitude versus exam points")

# draw the plot!
p4
## `geom_smooth()` using formula = 'y ~ x'

2.8 Exploring a data frame

Often the most interesting feature of your data are the relationships between the variables. If there are only a handful of variables saved as columns in a data frame, it is possible to visualize all of these relationships neatly in a single plot.

Base R offers a fast plotting function pairs(), which draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix. Libraries GGally and ggplot2 together offer a slow but more detailed look at the variables, their distributions and relationships.

Instructions

  • Draw a scatter matrix of the variables in learning2014 (other than gender)
  • Adjust the code: Add the argument col to the pairs() function, defining the colour with the ‘gender’ variable in learning2014.
  • Draw the plot again to see the changes.
  • Access the ggpot2 and GGally libraries and create the plot p with ggpairs().
  • Draw the plot. Note that the function is a bit slow.
  • Adjust the argument mapping of ggpairs() by defining col = gender inside aes().
  • Draw the plot again.
  • Adjust the code a little more: add another aesthetic element alpha = 0.3 inside aes().
  • See the difference between the plots?

Hints: - You can use $ to access a column of a data frame. - Remember to separate function arguments with a comma - You can draw the plot p by simply typing it’s name: just like printing R objects.

Looking at the data

We see that on average the students were 25 years old, most students seem to have a low attitude in regards to statistics. Given that the largest score in the exam was 33 points, we will assume that was the maximum score. If that is the case, we have that most students seem to have done quite well, with the majority of students being around 2x.xx points.

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available

# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])

# access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

Plotting just the histograms

In order to facilitate the analysis, we plot the histograms of the different variables. We can see that indeed the age tends to the left of the plot, meaning that the vast majority of people were around 2x years old. The attitude is clearly in the middle, looking similar to a normal distribution. Similar behavior can be seen to Deep, Stra and Surf, although it is important to note that Stra tends more to an uniform distribution when compared to the others. Finally, as I mentioned before we can see that the majority of people got a relatively good grade, around 2x points.

par(mfrow=c(2,3))
hist(learning2014$age,prob=T,main="Age")
hist(learning2014$attitude,prob=T,main="Attitude")
hist(learning2014$deep,prob=T,main="Deep")
hist(learning2014$stra,prob=T,main="Stra")
hist(learning2014$surf,prob=T,main="Surf")
hist(learning2014$points,prob=T,main="Points")

Variable selection

From this, we can already make a selection of the most interesting variables. We see that Age does not seem to correlate to the points column. However, Attitude seems to have a strong correlation. It is followed by Stra and then Surf, which seems to have an inverse relation to our target variable. Therefore, we choose Stra, Attitude and Surf for our model. We can double check our choices by literally checking the correlation of the variables to our target. As the previous plots suggested, our selection seem to have the best (inverse)association.

2.9 Simple regression

Regression analysis with R is easy once you have your data in a neat data frame. You can simply use the lm() function to fit a linear model. The first argument of lm() is a formula, which defines the target variable and the explanatory variable(s).

The formula should be y ~ x, where y is the target (or outcome) variable and x the explanatory variable (predictor). The second argument of lm() is data, which should be a data frame where y and x are columns.

The output of lm() is a linear model object, which can be saved for later use. The generic function summary() can be used to print out a summary of the model.

Instructions

  • Create a scatter plot of ‘points’ versus ‘attitude’.
  • Fit a regression model where ‘points’ is the target and ‘attitude’ is the explanatory variable
  • Print out the summary of the linear model object

Hints: - Replace 1 with the name of the explanatory variable in the formula inside lm() - Use summary() on the model object to print out a summary

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available

# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

# fit a linear model
my_model <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

2.10 Multiple regression

When there are more than one explanatory variables in the linear model, it is called multiple regression. In R, it is easy to include more than one explanatory variables in your linear model. This is done by simply defining more explanatory variables with the formula argument of lm(), as below

y ~ x1 + x2 + ..

Here y is again the target variable and x1, x2, .. are the explanatory variables.

Instructions

  • Draw a plot matrix of the learning2014 data with ggpairs()
  • Fit a regression model where points is the target variable and both attitude and stra are the explanatory variables.
  • Print out a summary of the model.
  • Adjust the code: Add one more explanatory variable to the model. Based on the plot matrix, choose the variable with the third highest (absolute) correlation with the target variable and use that as the third variable.
  • Print out a summary of the new model.

Hint: - The variable with the third highest absolute correlation with points is surf.

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available

library(GGally)
library(ggplot2)
# create an plot matrix with ggpairs()
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = learning2014)

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
# create a regression model with multiple explanatory variables
my_model3 <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Looking at the model summary

When looking at the summary of the models, we can see that attitude has indeed the lowest p-values. We also see that for both stra and surf we actually have too large of p-values, meaning that our model would probably perform better if we were to only use attitude.

2.11 Graphical model validation

R makes it easy to graphically explore the validity of your model assumptions. If you give a linear model object as the first argument to the plot() function, the function automatically assumes you want diagnostic plots and will produce them. You can check the help page of plotting an lm object by typing ?plot.lm or help(plot.lm) to the R console.

In the plot function you can then use the argument which to choose which plots you want. which must be an integer vector corresponding to the following list of plots:

which graphic
1 Residuals vs Fitted values
2 Normal QQ-plot
3 Standardized residuals vs Fitted values
4 Cook’s distances
5 Residuals vs Leverage
6 Cook’s distance vs Leverage


We will focus on plots 1, 2 and 5: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

Instructions

  • Create the linear model object my_model2
  • Produce the following diagnostic plots using the plot() function: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage using the argument which.
  • Before the call to the plot() function, add the following: par(mfrow = c(2,2)). This will place the following 4 graphics to the same plot. Execute the code again to see the effect.

Hint: - You can combine integers to an integer vector with c(). For example: c(1,2,3).

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = learning2014)

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which=c(1,2,5))

Plot analysis

From these plots we can see that our model works well for points close to the mean. However, for both extremities we have much larger losses.

2.12 Making predictions

Okay, so let’s assume that we have a linear model which seems to fit our standards. What can we do with it?

The model quantifies the relationship between the explanatory variable(s) and the dependent variable. The model can also be used for predicting the dependent variable based on new observations of the explanatory variable(s).

In R, predicting can be done using the predict() function. (see ?predict). The first argument of predict is a model object and the argument newdata (a data.frame) can be used to make predictions based on new observations. One or more columns of newdata should have the same name as the explanatory variables in the model object.

Instructions

  • Create object m and print out a summary of the model
  • Create object new_attitudes
  • Adjust the code: Create a new data frame with a column named ‘attitude’ holding the new attitudes defined in new_attitudes
  • Print out the new data frame
  • predict() the new student’s exam points based on their attitudes, using the newdata argument

Hints: - Type attitude = new_attitudes inside the data.frame() function. - Give the new_data data.frame as the newdata argument for predict()

R code

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# learning2014 is available

# Create model object m
m <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(m)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# New observations
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)

# Print out the new data
new_data
##        attitude
## Mia         3.8
## Mike        4.4
## Riikka      2.2
## Pekka       2.9
# Predict the new students exam points based on attitude
predict(m, newdata = new_data)
##      Mia     Mike   Riikka    Pekka 
## 25.03390 27.14918 19.39317 21.86099

Awesome work!


3 Analysis

As with before, I always prefer reading the data from the url just in case I made some mistake in the previous step.

This plot gives us some insight on the variables of the dataset.

library(tidyr); library(dplyr); library(ggplot2)

# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")

From the previous plots we select “sex”, “absences”, “reason”, “traveltime”. My initial idea for selecting them is

sex: both the differences in males and females (I assume males drink more alcohol or feel more inclined to reporting it).

absences: maybe we could see some people who like to party more, or just have a more carefree lifestyle.

reason: the reasons could also reflect on the lifestyle

traveltime: perhaps the traveltime time could point to different wealth status, which I would think could also affect alcohol consumption.

# draw a bar plot for alc variable
alc %>% select(alc_use) %>% gather %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")

# draw a bar plot of each variable
alc %>% select(sex, absences, reason, traveltime) %>% gather %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free")

For comparing the multiple variables I decided to use the crosstable function. The first variable we look at is the sex. Both sexes have a rather considerable chi-square value, meaning that this variable seems indeed interesting for our modelling purposes.

If the formatting is broken, I suggest opening them in a new window. This order was selected due to the high number of elements of some variables.

library(gmodels)
CrossTable(alc$sex, alc$alc_use)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  370 
## 
##  
##              | alc$alc_use 
##      alc$sex |         1 |       1.5 |         2 |       2.5 |         3 |       3.5 |         4 |       4.5 |         5 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            F |        87 |        41 |        26 |        25 |        11 |         3 |         1 |         0 |         1 |       195 | 
##              |     2.367 |     1.831 |     0.418 |     0.532 |     2.040 |     3.964 |     2.954 |     1.581 |     2.954 |           | 
##              |     0.446 |     0.210 |     0.133 |     0.128 |     0.056 |     0.015 |     0.005 |     0.000 |     0.005 |     0.527 | 
##              |     0.621 |     0.651 |     0.464 |     0.610 |     0.344 |     0.176 |     0.111 |     0.000 |     0.111 |           | 
##              |     0.235 |     0.111 |     0.070 |     0.068 |     0.030 |     0.008 |     0.003 |     0.000 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            M |        53 |        22 |        30 |        16 |        21 |        14 |         8 |         3 |         8 |       175 | 
##              |     2.638 |     2.040 |     0.466 |     0.593 |     2.273 |     4.417 |     3.292 |     1.762 |     3.292 |           | 
##              |     0.303 |     0.126 |     0.171 |     0.091 |     0.120 |     0.080 |     0.046 |     0.017 |     0.046 |     0.473 | 
##              |     0.379 |     0.349 |     0.536 |     0.390 |     0.656 |     0.824 |     0.889 |     1.000 |     0.889 |           | 
##              |     0.143 |     0.059 |     0.081 |     0.043 |     0.057 |     0.038 |     0.022 |     0.008 |     0.022 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       140 |        63 |        56 |        41 |        32 |        17 |         9 |         3 |         9 |       370 | 
##              |     0.378 |     0.170 |     0.151 |     0.111 |     0.086 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

A chi-square value of 0 would mean that the variables are completely independent. To my surprise, the highest consumption of alcohol seems to have a larger dependency to the lower absences. They seem to have some relation, just not what I expected.

library(gmodels)
CrossTable(alc$absences, alc$alc_use)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  370 
## 
##  
##              | alc$alc_use 
## alc$absences |         1 |       1.5 |         2 |       2.5 |         3 |       3.5 |         4 |       4.5 |         5 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            0 |        31 |        10 |         9 |         6 |         4 |         3 |         0 |         0 |         0 |        63 | 
##              |     2.152 |     0.049 |     0.030 |     0.138 |     0.385 |     0.004 |     1.532 |     0.511 |     1.532 |           | 
##              |     0.492 |     0.159 |     0.143 |     0.095 |     0.063 |     0.048 |     0.000 |     0.000 |     0.000 |     0.170 | 
##              |     0.221 |     0.159 |     0.161 |     0.146 |     0.125 |     0.176 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.084 |     0.027 |     0.024 |     0.016 |     0.011 |     0.008 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            1 |        18 |        10 |         9 |         5 |         6 |         2 |         0 |         0 |         0 |        50 | 
##              |     0.045 |     0.260 |     0.271 |     0.053 |     0.649 |     0.038 |     1.216 |     0.405 |     1.216 |           | 
##              |     0.360 |     0.200 |     0.180 |     0.100 |     0.120 |     0.040 |     0.000 |     0.000 |     0.000 |     0.135 | 
##              |     0.129 |     0.159 |     0.161 |     0.122 |     0.188 |     0.118 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.049 |     0.027 |     0.024 |     0.014 |     0.016 |     0.005 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            2 |        24 |         8 |         8 |         3 |         6 |         2 |         3 |         0 |         2 |        56 | 
##              |     0.373 |     0.247 |     0.027 |     1.656 |     0.276 |     0.128 |     1.969 |     0.454 |     0.299 |           | 
##              |     0.429 |     0.143 |     0.143 |     0.054 |     0.107 |     0.036 |     0.054 |     0.000 |     0.036 |     0.151 | 
##              |     0.171 |     0.127 |     0.143 |     0.073 |     0.188 |     0.118 |     0.333 |     0.000 |     0.222 |           | 
##              |     0.065 |     0.022 |     0.022 |     0.008 |     0.016 |     0.005 |     0.008 |     0.000 |     0.005 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            3 |        21 |         6 |         4 |         4 |         1 |         0 |         1 |         0 |         1 |        38 | 
##              |     3.049 |     0.034 |     0.533 |     0.011 |     1.591 |     1.746 |     0.006 |     0.308 |     0.006 |           | 
##              |     0.553 |     0.158 |     0.105 |     0.105 |     0.026 |     0.000 |     0.026 |     0.000 |     0.026 |     0.103 | 
##              |     0.150 |     0.095 |     0.071 |     0.098 |     0.031 |     0.000 |     0.111 |     0.000 |     0.111 |           | 
##              |     0.057 |     0.016 |     0.011 |     0.011 |     0.003 |     0.000 |     0.003 |     0.000 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            4 |        14 |         5 |         5 |         5 |         2 |         3 |         0 |         0 |         1 |        35 | 
##              |     0.043 |     0.154 |     0.017 |     0.324 |     0.348 |     1.205 |     0.851 |     0.284 |     0.026 |           | 
##              |     0.400 |     0.143 |     0.143 |     0.143 |     0.057 |     0.086 |     0.000 |     0.000 |     0.029 |     0.095 | 
##              |     0.100 |     0.079 |     0.089 |     0.122 |     0.062 |     0.176 |     0.000 |     0.000 |     0.111 |           | 
##              |     0.038 |     0.014 |     0.014 |     0.014 |     0.005 |     0.008 |     0.000 |     0.000 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            5 |         4 |         7 |         5 |         2 |         2 |         1 |         0 |         0 |         1 |        22 | 
##              |     2.246 |     2.827 |     0.838 |     0.079 |     0.005 |     0.000 |     0.535 |     0.178 |     0.404 |           | 
##              |     0.182 |     0.318 |     0.227 |     0.091 |     0.091 |     0.045 |     0.000 |     0.000 |     0.045 |     0.059 | 
##              |     0.029 |     0.111 |     0.089 |     0.049 |     0.062 |     0.059 |     0.000 |     0.000 |     0.111 |           | 
##              |     0.011 |     0.019 |     0.014 |     0.005 |     0.005 |     0.003 |     0.000 |     0.000 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            6 |        10 |         3 |         3 |         3 |         1 |         0 |         1 |         0 |         0 |        21 | 
##              |     0.531 |     0.093 |     0.010 |     0.195 |     0.367 |     0.965 |     0.468 |     0.170 |     0.511 |           | 
##              |     0.476 |     0.143 |     0.143 |     0.143 |     0.048 |     0.000 |     0.048 |     0.000 |     0.000 |     0.057 | 
##              |     0.071 |     0.048 |     0.054 |     0.073 |     0.031 |     0.000 |     0.111 |     0.000 |     0.000 |           | 
##              |     0.027 |     0.008 |     0.008 |     0.008 |     0.003 |     0.000 |     0.003 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            7 |         3 |         2 |         4 |         1 |         0 |         1 |         1 |         0 |         0 |        12 | 
##              |     0.523 |     0.001 |     2.626 |     0.082 |     1.038 |     0.365 |     1.718 |     0.097 |     0.292 |           | 
##              |     0.250 |     0.167 |     0.333 |     0.083 |     0.000 |     0.083 |     0.083 |     0.000 |     0.000 |     0.032 | 
##              |     0.021 |     0.032 |     0.071 |     0.024 |     0.000 |     0.059 |     0.111 |     0.000 |     0.000 |           | 
##              |     0.008 |     0.005 |     0.011 |     0.003 |     0.000 |     0.003 |     0.003 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            8 |         6 |         4 |         4 |         1 |         2 |         1 |         1 |         1 |         0 |        20 | 
##              |     0.325 |     0.104 |     0.313 |     0.667 |     0.042 |     0.007 |     0.542 |     4.329 |     0.486 |           | 
##              |     0.300 |     0.200 |     0.200 |     0.050 |     0.100 |     0.050 |     0.050 |     0.050 |     0.000 |     0.054 | 
##              |     0.043 |     0.063 |     0.071 |     0.024 |     0.062 |     0.059 |     0.111 |     0.333 |     0.000 |           | 
##              |     0.016 |     0.011 |     0.011 |     0.003 |     0.005 |     0.003 |     0.003 |     0.003 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            9 |         2 |         2 |         1 |         3 |         1 |         1 |         0 |         0 |         1 |        11 | 
##              |     1.123 |     0.009 |     0.266 |     2.603 |     0.002 |     0.484 |     0.268 |     0.089 |     2.005 |           | 
##              |     0.182 |     0.182 |     0.091 |     0.273 |     0.091 |     0.091 |     0.000 |     0.000 |     0.091 |     0.030 | 
##              |     0.014 |     0.032 |     0.018 |     0.073 |     0.031 |     0.059 |     0.000 |     0.000 |     0.111 |           | 
##              |     0.005 |     0.005 |     0.003 |     0.008 |     0.003 |     0.003 |     0.000 |     0.000 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           10 |         4 |         0 |         1 |         0 |         0 |         1 |         0 |         0 |         1 |         7 | 
##              |     0.689 |     1.192 |     0.003 |     0.776 |     0.605 |     1.431 |     0.170 |     0.057 |     4.043 |           | 
##              |     0.571 |     0.000 |     0.143 |     0.000 |     0.000 |     0.143 |     0.000 |     0.000 |     0.143 |     0.019 | 
##              |     0.029 |     0.000 |     0.018 |     0.000 |     0.000 |     0.059 |     0.000 |     0.000 |     0.111 |           | 
##              |     0.011 |     0.000 |     0.003 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           11 |         0 |         2 |         0 |         2 |         1 |         0 |         0 |         0 |         0 |         5 | 
##              |     1.892 |     1.550 |     0.757 |     3.774 |     0.745 |     0.230 |     0.122 |     0.041 |     0.122 |           | 
##              |     0.000 |     0.400 |     0.000 |     0.400 |     0.200 |     0.000 |     0.000 |     0.000 |     0.000 |     0.014 | 
##              |     0.000 |     0.032 |     0.000 |     0.049 |     0.031 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.005 |     0.000 |     0.005 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           12 |         1 |         0 |         2 |         0 |         2 |         0 |         2 |         0 |         0 |         7 | 
##              |     1.026 |     1.192 |     0.835 |     0.776 |     3.213 |     0.322 |    19.662 |     0.057 |     0.170 |           | 
##              |     0.143 |     0.000 |     0.286 |     0.000 |     0.286 |     0.000 |     0.286 |     0.000 |     0.000 |     0.019 | 
##              |     0.007 |     0.000 |     0.036 |     0.000 |     0.062 |     0.000 |     0.222 |     0.000 |     0.000 |           | 
##              |     0.003 |     0.000 |     0.005 |     0.000 |     0.005 |     0.000 |     0.005 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           13 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         0 |         1 |         2 | 
##              |     0.757 |     1.277 |     0.303 |     0.222 |     0.173 |     0.092 |     0.049 |     0.016 |    18.604 |           | 
##              |     0.000 |     0.500 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 |     0.005 | 
##              |     0.000 |     0.016 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.111 |           | 
##              |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           14 |         0 |         0 |         1 |         3 |         0 |         0 |         0 |         2 |         1 |         7 | 
##              |     2.649 |     1.192 |     0.003 |     6.378 |     0.605 |     0.322 |     0.170 |    66.533 |     4.043 |           | 
##              |     0.000 |     0.000 |     0.143 |     0.429 |     0.000 |     0.000 |     0.000 |     0.286 |     0.143 |     0.019 | 
##              |     0.000 |     0.000 |     0.018 |     0.073 |     0.000 |     0.000 |     0.000 |     0.667 |     0.111 |           | 
##              |     0.000 |     0.000 |     0.003 |     0.008 |     0.000 |     0.000 |     0.000 |     0.005 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           16 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         1 | 
##              |     0.378 |     0.170 |     0.151 |     0.111 |     9.649 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     0.031 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           17 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         1 | 
##              |     0.378 |     0.170 |     0.151 |     7.135 |     0.086 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
##              |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 | 
##              |     0.000 |     0.000 |     0.000 |     0.024 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           18 |         0 |         1 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         2 | 
##              |     0.757 |     1.277 |     0.303 |     0.222 |     0.173 |     8.974 |     0.049 |     0.016 |     0.049 |           | 
##              |     0.000 |     0.500 |     0.000 |     0.000 |     0.000 |     0.500 |     0.000 |     0.000 |     0.000 |     0.005 | 
##              |     0.000 |     0.016 |     0.000 |     0.000 |     0.000 |     0.059 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           19 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         1 | 
##              |     0.378 |     0.170 |     0.151 |     0.111 |     9.649 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     0.031 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           20 |         0 |         2 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         2 | 
##              |     0.757 |     8.087 |     0.303 |     0.222 |     0.173 |     0.092 |     0.049 |     0.016 |     0.049 |           | 
##              |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.005 | 
##              |     0.000 |     0.032 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.005 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           21 |         1 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         2 | 
##              |     0.078 |     0.341 |     0.303 |     0.222 |     0.173 |     8.974 |     0.049 |     0.016 |     0.049 |           | 
##              |     0.500 |     0.000 |     0.000 |     0.000 |     0.000 |     0.500 |     0.000 |     0.000 |     0.000 |     0.005 | 
##              |     0.007 |     0.000 |     0.000 |     0.000 |     0.000 |     0.059 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           26 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         1 | 
##              |     0.378 |     0.170 |     0.151 |     0.111 |     9.649 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     0.031 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           27 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         1 | 
##              |     0.378 |     0.170 |     0.151 |     7.135 |     0.086 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
##              |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 | 
##              |     0.000 |     0.000 |     0.000 |     0.024 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           29 |         0 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         1 | 
##              |     0.378 |     0.170 |     0.151 |     0.111 |     9.649 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     0.031 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           44 |         0 |         0 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         1 | 
##              |     0.378 |     0.170 |     0.151 |     7.135 |     0.086 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
##              |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 | 
##              |     0.000 |     0.000 |     0.000 |     0.024 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##           45 |         1 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         1 | 
##              |     1.021 |     0.170 |     0.151 |     0.111 |     0.086 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
##              |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.003 | 
##              |     0.007 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
##              |     0.003 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       140 |        63 |        56 |        41 |        32 |        17 |         9 |         3 |         9 |       370 | 
##              |     0.378 |     0.170 |     0.151 |     0.111 |     0.086 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

For the reason it seems that mostly they seem independent. However, there are variables which show a higher score such as reputation and consumption 2.

library(gmodels)
CrossTable(alc$reason, alc$alc_use)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  370 
## 
##  
##              | alc$alc_use 
##   alc$reason |         1 |       1.5 |         2 |       2.5 |         3 |       3.5 |         4 |       4.5 |         5 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##       course |        53 |        26 |        18 |        12 |        12 |         7 |         3 |         1 |         3 |       135 | 
##              |     0.072 |     0.395 |     0.290 |     0.585 |     0.009 |     0.102 |     0.025 |     0.008 |     0.025 |           | 
##              |     0.393 |     0.193 |     0.133 |     0.089 |     0.089 |     0.052 |     0.022 |     0.007 |     0.022 |     0.365 | 
##              |     0.379 |     0.413 |     0.321 |     0.293 |     0.375 |     0.412 |     0.333 |     0.333 |     0.333 |           | 
##              |     0.143 |     0.070 |     0.049 |     0.032 |     0.032 |     0.019 |     0.008 |     0.003 |     0.008 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##         home |        39 |        15 |        13 |        15 |        11 |         4 |         2 |         1 |         3 |       103 | 
##              |     0.000 |     0.367 |     0.430 |     1.127 |     0.491 |     0.113 |     0.102 |     0.033 |     0.098 |           | 
##              |     0.379 |     0.146 |     0.126 |     0.146 |     0.107 |     0.039 |     0.019 |     0.010 |     0.029 |     0.278 | 
##              |     0.279 |     0.238 |     0.232 |     0.366 |     0.344 |     0.235 |     0.222 |     0.333 |     0.333 |           | 
##              |     0.105 |     0.041 |     0.035 |     0.041 |     0.030 |     0.011 |     0.005 |     0.003 |     0.008 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##        other |         9 |         4 |         5 |         5 |         3 |         4 |         1 |         1 |         2 |        34 | 
##              |     1.161 |     0.553 |     0.004 |     0.403 |     0.001 |     3.804 |     0.036 |     1.903 |     1.664 |           | 
##              |     0.265 |     0.118 |     0.147 |     0.147 |     0.088 |     0.118 |     0.029 |     0.029 |     0.059 |     0.092 | 
##              |     0.064 |     0.063 |     0.089 |     0.122 |     0.094 |     0.235 |     0.111 |     0.333 |     0.222 |           | 
##              |     0.024 |     0.011 |     0.014 |     0.014 |     0.008 |     0.011 |     0.003 |     0.003 |     0.005 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##   reputation |        39 |        18 |        20 |         9 |         6 |         2 |         3 |         0 |         1 |        98 | 
##              |     0.099 |     0.103 |     1.800 |     0.318 |     0.723 |     1.391 |     0.159 |     0.795 |     0.803 |           | 
##              |     0.398 |     0.184 |     0.204 |     0.092 |     0.061 |     0.020 |     0.031 |     0.000 |     0.010 |     0.265 | 
##              |     0.279 |     0.286 |     0.357 |     0.220 |     0.188 |     0.118 |     0.333 |     0.000 |     0.111 |           | 
##              |     0.105 |     0.049 |     0.054 |     0.024 |     0.016 |     0.005 |     0.008 |     0.000 |     0.003 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       140 |        63 |        56 |        41 |        32 |        17 |         9 |         3 |         9 |       370 | 
##              |     0.378 |     0.170 |     0.151 |     0.111 |     0.086 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

As expected, there seem indeed to be some relationship between travel distance and alcohol consumption. We see that as the distance increases, the total score decreases.

library(gmodels)
CrossTable(alc$traveltime, alc$alc_use)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  370 
## 
##  
##                | alc$alc_use 
## alc$traveltime |         1 |       1.5 |         2 |       2.5 |         3 |       3.5 |         4 |       4.5 |         5 | Row Total | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##              1 |        95 |        42 |        37 |        24 |        21 |        11 |         6 |         1 |         5 |       242 | 
##                |     0.129 |     0.015 |     0.004 |     0.296 |     0.000 |     0.001 |     0.002 |     0.472 |     0.134 |           | 
##                |     0.393 |     0.174 |     0.153 |     0.099 |     0.087 |     0.045 |     0.025 |     0.004 |     0.021 |     0.654 | 
##                |     0.679 |     0.667 |     0.661 |     0.585 |     0.656 |     0.647 |     0.667 |     0.333 |     0.556 |           | 
##                |     0.257 |     0.114 |     0.100 |     0.065 |     0.057 |     0.030 |     0.016 |     0.003 |     0.014 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##              2 |        40 |        16 |        17 |        14 |         4 |         4 |         1 |         1 |         2 |        99 | 
##                |     0.172 |     0.044 |     0.271 |     0.837 |     2.431 |     0.066 |     0.823 |     0.048 |     0.069 |           | 
##                |     0.404 |     0.162 |     0.172 |     0.141 |     0.040 |     0.040 |     0.010 |     0.010 |     0.020 |     0.268 | 
##                |     0.286 |     0.254 |     0.304 |     0.341 |     0.125 |     0.235 |     0.111 |     0.333 |     0.222 |           | 
##                |     0.108 |     0.043 |     0.046 |     0.038 |     0.011 |     0.011 |     0.003 |     0.003 |     0.005 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##              3 |         4 |         4 |         2 |         3 |         6 |         1 |         1 |         0 |         0 |        21 | 
##                |     1.960 |     0.050 |     0.437 |     0.195 |     9.638 |     0.001 |     0.468 |     0.170 |     0.511 |           | 
##                |     0.190 |     0.190 |     0.095 |     0.143 |     0.286 |     0.048 |     0.048 |     0.000 |     0.000 |     0.057 | 
##                |     0.029 |     0.063 |     0.036 |     0.073 |     0.188 |     0.059 |     0.111 |     0.000 |     0.000 |           | 
##                |     0.011 |     0.011 |     0.005 |     0.008 |     0.016 |     0.003 |     0.003 |     0.000 |     0.000 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##              4 |         1 |         1 |         0 |         0 |         1 |         1 |         1 |         1 |         2 |         8 | 
##                |     1.357 |     0.096 |     1.211 |     0.886 |     0.137 |     1.088 |     3.333 |    13.482 |    16.750 |           | 
##                |     0.125 |     0.125 |     0.000 |     0.000 |     0.125 |     0.125 |     0.125 |     0.125 |     0.250 |     0.022 | 
##                |     0.007 |     0.016 |     0.000 |     0.000 |     0.031 |     0.059 |     0.111 |     0.333 |     0.222 |           | 
##                |     0.003 |     0.003 |     0.000 |     0.000 |     0.003 |     0.003 |     0.003 |     0.003 |     0.005 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##   Column Total |       140 |        63 |        56 |        41 |        32 |        17 |         9 |         3 |         9 |       370 | 
##                |     0.378 |     0.170 |     0.151 |     0.111 |     0.086 |     0.046 |     0.024 |     0.008 |     0.024 |           | 
## ---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

Logistic regression

We now create a logistic regression model using our selected variables. The first thing to note is the Z-statistics. The sex (Male for reference class) and absences seem to be the most relevant variables. I assume using only male for reference is ok, as we saw mostly very similar values when analysing them previously. It is followed by travel distance, as we indicated before. It is interesting to see that for some of the reasons there is a high value, meaning they are likely less useful for our model. However, of them the variable “other” still seems to hold some contribution.

Regarding the odds ratios, we have that a value of 1 would mean the two variables are independent. Larger than 1 means they are correlated and lower than 1 means they are negatively correlated. For our purposes I would argue that any value considerably away from 1 is useful for our model. After all, a strong indication of a value or the negation of that value would help guide the model. For our case, we see that sex indeed has a significantly larger than 1 odds ratio, with even its lower confidence interval above 1. Meaning a good correlation between it and our target. Absences on the other hand shows us that there is a consistent correlation, but the distance to 1 is much smaller. This was not shown as clearly in our previous metric. Travel time on the other hand shows a larger upper confidence interval, meaning that for some cases it does give us more information. The reason variable has the most interesting behavior in my opinion. Although before they had a high value for the Z-statistics, we now see that the odd ratio is not super good, but from their confidence intervals we see that they sometimes have a negative correlation and sometimes a positive one. I think this actually means they are not good predictive variables, as their relationship seems to change depending on the value. This would make learning the model, in my opinion, more complicated.

# find the model with glm()
m <- glm(high_use ~ sex + absences + traveltime + reason, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + absences + traveltime + reason, 
##     family = "binomial", data = alc)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.50233    0.38742  -6.459 1.05e-10 ***
## sexM              0.97973    0.25027   3.915 9.05e-05 ***
## absences          0.10076    0.02457   4.101 4.11e-05 ***
## traveltime        0.41903    0.16736   2.504   0.0123 *  
## reasonhome        0.12618    0.30888   0.409   0.6829    
## reasonother       0.89956    0.41784   2.153   0.0313 *  
## reasonreputation -0.36535    0.33099  -1.104   0.2697    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 401.49  on 363  degrees of freedom
## AIC: 415.49
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
##      (Intercept)             sexM         absences       traveltime 
##       -2.5023266        0.9797277        0.1007637        0.4190261 
##       reasonhome      reasonother reasonreputation 
##        0.1261831        0.8995610       -0.3653526
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                          OR     2.5 %    97.5 %
## (Intercept)      0.08189424 0.0373175 0.1710846
## sexM             2.66373089 1.6404768 4.3849742
## absences         1.10601531 1.0565900 1.1635523
## traveltime       1.52048007 1.0957873 2.1182805
## reasonhome       1.13448993 0.6171224 2.0781000
## reasonother      2.45852348 1.0792890 5.6012936
## reasonreputation 0.69395194 0.3583216 1.3181019

With our model we can now make predictions. For this case we assume a probability larger than 0.5 to be TRUE, and lower than that FALSE. The correct predictions is the left to right diagonal in our table, and the wrong predictions the diagonal from right to left. In total we had 89 wrong cases, out of 370 cases, leading us to a 0.24 loss. If we just guessed values, one might expect a 50% error rate. In that case, we are doing pretty good.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 6
##    failures absences sex   high_use probability prediction
##       <dbl>    <dbl> <chr> <lgl>          <dbl> <lgl>     
##  1        0        3 M     FALSE          0.436 FALSE     
##  2        1        0 M     FALSE          0.554 TRUE      
##  3        1        7 M     TRUE           0.537 TRUE      
##  4        0        1 F     FALSE          0.340 FALSE     
##  5        0        6 F     FALSE          0.268 FALSE     
##  6        1        2 F     FALSE          0.132 FALSE     
##  7        0        2 F     FALSE          0.132 FALSE     
##  8        0        3 F     FALSE          0.204 FALSE     
##  9        0        4 M     TRUE           0.430 FALSE     
## 10        0        2 M     TRUE           0.484 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     78   33

Since we noticed that the reason variable seems to be unreliable for the predictions, we try a new model without it. We actually have a lower score, having missed 91 values, meaning a 0.246 loss. This could be because at least some of the values actually contributed in some cases, meaning a net loss when not considering this variable.

m <- glm(high_use ~ sex + absences + traveltime, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 6
##    failures absences sex   high_use probability prediction
##       <dbl>    <dbl> <chr> <lgl>          <dbl> <lgl>     
##  1        0        3 M     FALSE          0.417 FALSE     
##  2        1        0 M     FALSE          0.347 FALSE     
##  3        1        7 M     TRUE           0.515 TRUE      
##  4        0        1 F     FALSE          0.177 FALSE     
##  5        0        6 F     FALSE          0.345 FALSE     
##  6        1        2 F     FALSE          0.137 FALSE     
##  7        0        2 F     FALSE          0.137 FALSE     
##  8        0        3 F     FALSE          0.207 FALSE     
##  9        0        4 M     TRUE           0.441 FALSE     
## 10        0        2 M     TRUE           0.492 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   249   10
##    TRUE     81   30

Cross-validation

We now use cross validation with k = 10 to check how good our model actually is. The idea is to train and test with different combinations of values, in order to see how good the model reacts to our of training examples. We see that the error is a bit higher than before, which is expected. However, seems like our model slightly outperforms the model suggested by the exercises, which had a loss of 0.26.

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2405405
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486486

4 Analysis

Data

Here we take a look at the data. This example dataset has 506 rows and 14 columns. The data itself is about housing values in suburbs of boston. For example, some variables represent per capita crime rate by town (crim) and average number of rooms per dwelling (rm).

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Pairs

We can use the pairs function to take a further look into the data and their relationships. This function plots each variable against the other. We can also see the discrete variables. For example, “chas” which takes values of 1 or 0. The plot therefore shows how the other variables relate to their values. Some variables show a strong relation such as “lstat” and “medv”. This makes sense, as the former is the percentage of lower status of the population and the latter is the median value of owner-occupied homes in $1000s. I think this plots also shows us why we need to standardize the dataset. Different variables also have very different scales.

# plot the Boston dataset with clusters
pairs(Boston)

Summary

The different scales can be better seen here. We see for example that for “crim” the values stay between 0 and 100, whereas a variable such as “nox” stays between 0.3 and 0.9.

# plot the Boston dataset with clusters
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Scale

We now use the function scale to scale (duh) the dataset. Although the variables are not in the same exact range, we see that they all have the same magnitude. The idea of standardizing the data like this is to center it around 0 with a standard deviation of 1. We can see it by summarizing the data again.

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Categorize

By using quantiles we can discretize the crime variable and turn it from numeric to categorical. Looking at the summary again, we can see the new crime variable.

data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv                     crime    
##  Min.   :-1.9063   [-0.419,-0.411]:127  
##  1st Qu.:-0.5989   (-0.411,-0.39] :126  
##  Median :-0.1449   (-0.39,0.00739]:126  
##  Mean   : 0.0000   (0.00739,9.92] :127  
##  3rd Qu.: 0.2683                        
##  Max.   : 2.9865

Fitting

Test and Train Split

For training and validation it is common to separate the data into a training dataset and a validation set. We now separate 80% of the data for training and 20% for validation. Since we want to learn to predict the crime variable, we remove it from the test dataset.

n <- nrow(Boston)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

LDA

Linear discriminant analysis (LDA) is more commonly used for dimensionality reduction. The idea is to find a linear combination of features that can describe or differenciate two or more classes of target objects or events. However, we can also use it as a linear classifier. The latter is done below.

We also cross reference the correct classes vs the predictions. The diagonal from left to right represent the correct predictions. We can easily see that the majority of the predictions was correct.

lda.fit <- lda(crime ~ ., data = train)
predictions <- predict(lda.fit, newdata = test)
library(gmodels)
CrossTable(correct_classes, predictions$class)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  102 
## 
##  
##                 | predictions$class 
## correct_classes | [-0.419,-0.411] |  (-0.411,-0.39] | (-0.39,0.00739] |  (0.00739,9.92] |       Row Total | 
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
## [-0.419,-0.411] |              16 |               7 |               1 |               0 |              24 | 
##                 |          20.716 |           0.012 |           2.030 |           7.529 |                 | 
##                 |           0.667 |           0.292 |           0.042 |           0.000 |           0.235 | 
##                 |           0.696 |           0.226 |           0.062 |           0.000 |                 | 
##                 |           0.157 |           0.069 |           0.010 |           0.000 |                 | 
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
##  (-0.411,-0.39] |               6 |              14 |               3 |               0 |              23 | 
##                 |           0.128 |           7.029 |           0.102 |           7.216 |                 | 
##                 |           0.261 |           0.609 |           0.130 |           0.000 |           0.225 | 
##                 |           0.261 |           0.452 |           0.188 |           0.000 |                 | 
##                 |           0.059 |           0.137 |           0.029 |           0.000 |                 | 
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
## (-0.39,0.00739] |               1 |              10 |              12 |               1 |              24 | 
##                 |           3.597 |           1.004 |          18.015 |           5.662 |                 | 
##                 |           0.042 |           0.417 |           0.500 |           0.042 |           0.235 | 
##                 |           0.043 |           0.323 |           0.750 |           0.031 |                 | 
##                 |           0.010 |           0.098 |           0.118 |           0.010 |                 | 
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
##  (0.00739,9.92] |               0 |               0 |               0 |              31 |              31 | 
##                 |           6.990 |           9.422 |           4.863 |          46.538 |                 | 
##                 |           0.000 |           0.000 |           0.000 |           1.000 |           0.304 | 
##                 |           0.000 |           0.000 |           0.000 |           0.969 |                 | 
##                 |           0.000 |           0.000 |           0.000 |           0.304 |                 | 
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
##    Column Total |              23 |              31 |              16 |              32 |             102 | 
##                 |           0.225 |           0.304 |           0.157 |           0.314 |                 | 
## ----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
## 
## 

Distances

Instead of using the crime as a target, we can try and cluster the data using k-means. For that, we will use the distance between points. This can be done using many distance computing algorithms. For our purposes I believe the euclidean distance will suffice.

library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
dist <- dist(boston_scaled)
summary(dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

K-means

This is a clustering method. It will cluster the data into k different clusters. Since this method minimizes the within cluster variance, we see why I choose to look at their euclidean distances. That is, k-means minimizes the squared euclidean distances.

I also plot the results for a reduced ammount of variables. We see that maybe 3 clusters was too much, as the vast majority of points fall between one (red) or the other (black) but almost no points fall in the third cluster (green)

# k-means clustering
km <- kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston[c("rm", "age", "dis", "crim")], col = km$cluster)

For that reason I redo the k-means using only 2 centers. We see that the data has a clear separation between the clusters, without too much intersections. This is good, as one of the issues of k-means is that sometimes it can bleed points from one cluster to the other due to its tendency of creating cluster of similar sizes.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston[c("rm", "age", "dis", "crim")], col = km$cluster)

Bonus

We now perform LDA using the clusters as targets instead of the variable crime. Since we will need a proper clustering for our predictions, I will do a small search of different cluster numbers with a set seed to remove the randomness of selecting the initial clusters.

library(ggplot2)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

This metric can give us an indication of the number of clusters to use. We see a drastic reduction of it around 3 clusters, and so this is the value we will use next. It is important to note that this goes against my previous experience. However, now that we set the seed our k-means should have a similar behavior to this last one.

library(MASS)
set.seed(123)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))

km <- kmeans(boston_scaled, centers = 3)

boston_scaled$cluster <- km$cluster

ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$cluster
test <- dplyr::select(test, -cluster)

lda.fit = lda(cluster ~ ., data=train)

With our new model, we can once again predict and see a table of results. And it looks good! The vast majority of points have been correctly predicted. We can also see what I mentioned before. The sizes of the clusters are remarkably similar, all of them around 30 points.

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##        predicted
## correct  1  2  3
##       1 30  5  0
##       2  0 31  5
##       3  0  2 29

We can also plot our results using arrows to represent the relationship of the original variables to the LDA solution. We see that both “rad” and “age” seem to be the major variables in this new space. Naturally the other variables have some effect, like “tax”. Still, these two seem to be the most influential. This is supported both by their lengths (meaning a larger variation) and the angle between them, showing almost no correlation (90 degree angle, or close to) which is ideal for a vector space.

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$cluster)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 2)

SUPER Bonus

We can now merge our different approaches and compare them. We can see that both plots are in fact quite similar. However, I would argue that the k-means one is in fact better, as we can see that in the cluster 1 (purple) we have all points actually together in the left part of the plot, whereas in the previous one we had some intersection of different values there. It is important to note, however, that I used our value of 3 clusters. The previous coloring however had more options, and so this difference might actually be quite minimal. Specially since the points colored there were in fact the next closes value (high and mid high).

##################################
library(MASS)
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.fit = lda(crime ~ ., data=train)

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# And now with k-means
km <- kmeans(train %>% dplyr::select(-crime), centers = 3)
train$cluster <- km$cluster

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$cluster)

5 Analysis

As usual, even though I finished the wrangling data part, I always rather use the one given by the lecturer.

library(readr)
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Our first step is to move the country column to rownames.

# move the country names to rownames
library(tibble)
human <- column_to_rownames(human, "Country")

We can then take a closer look at the data. From the previous step we have now each country as an index. Each column therefore refers to the data of that country. We have

We can also notice that some variables have much larger magnitudes then others. This can make further analysis difficult, and so later on we will scale the data to solve this.

head(human)
##               Edu2.FM   Labo.FM Life.Exp Edu.Exp   GNI Mat.Mor Ado.Birth
## Norway      1.0072389 0.8908297     81.6    17.5 64992       4       7.8
## Australia   0.9968288 0.8189415     82.4    20.2 42261       6      12.1
## Switzerland 0.9834369 0.8251001     83.0    15.8 56431       6       1.9
## Denmark     0.9886128 0.8840361     80.2    18.7 44025       5       5.1
## Netherlands 0.9690608 0.8286119     81.6    17.9 45435       6       6.2
## Germany     0.9927835 0.8072289     80.9    16.5 43919       7       3.8
##             Parli.F
## Norway         39.6
## Australia      30.5
## Switzerland    28.5
## Denmark        38.0
## Netherlands    36.9
## Germany        36.9
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

I personally enjoy the ggpairs function to get an idea of the data. On the diagonal we see the distribution of the data. On the lower part of the plots we see the relation of variable values in relation to the other variables. On the upper part of the plots we have the correlation of the variables.

library(GGally)
ggpairs(human, progress = FALSE)

PCA

We now use PCA in our data without any more preprocessing. It linearly transforms the data into a new coordinate system. We can use this for dimensionality reduction, for example in machine learning tasks. As expected, GNI overwhelms all other variables in the representation due to its much larger magnitude.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

However, we can redo the previous analysis on a scaled version of the data. We now see a much more distributed spread of vector magnitudes for the different variables.

For example, we see that variables which have a close relationship in real life end up having similar vectors. Look at GNI and secondary education. It is well known that wealthier countries also have larger amount of education. Interestingly, this relationship is not indicated by the correlation of these two variables. Other related variables are the life expectancy and expected years of education.On the other side we have adolescent births and mother mortality. These two variables have similar vectors, as they have a strong interaction statistically. The final two vectors relate to female participation in parliament and work force.

From that, my interpretation of these component dimensions is country wealth for PC1 and female participation (power) both in politics and economically for PC2.

library(dplyr)
human_std <- scale(human)
human_std2 <- human %>% rename(`Female secondary education` = Edu2.FM, `Female in the work force` = Labo.FM, 
                     `Life expectancy` = Life.Exp, `Expected education` = Edu.Exp, `Gross income per capta` = GNI,
                      `Maternal mortality` = Mat.Mor, `Adolescent birth` = Ado.Birth, `Female participation in parliament` = Parli.F) %>% scale
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
pca_human2 <- prcomp(human_std2)
biplot(pca_human2, choices = 1:2, cex = c(0.1, 0.7), col = c("grey40", "deeppink2"))

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

MCA

For this next part we will make use of the “tea” dataset. Taking a basic look at the data using summary and view. It is a fun dataset, with difference types of tea, how they are consumed, with or no sugar, where were they bought and if they drink said tea at lunch.

library(FactoMineR)
tea_time <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea_time.csv", stringsAsFactors = TRUE)
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
view(tea_time)

Multiple correspondence analysis represents data in a low dimensional space. It can be seen as similar to PCA, but can be applied to categorical data.

For the factor map, we can interprete the distance between points as a metric of their similarity. For example, tea bags seem close to chain shops, wereas unpackaged tea seems close to tea shops. We see a similar behavior in the biplot.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",habillage = "quali")

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_mca_biplot(mca)


(more chapters to be added similarly as we proceed with the course!)